Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nds_h validation script #198

Merged
merged 3 commits into from
Oct 23, 2024
Merged

Conversation

yinqingh
Copy link
Collaborator

This PR is to add a validation script for nds_h queries to compare outputs between CPU and GPU. The main checking logic is the same with nds_validate.py but remove some special checks for NDS queries.

=== Comparing Query: query4 ===
Collecting rows from DataFrame
Collected 5 rows in 0.16502952575683594 seconds
Collecting rows from DataFrame
Collected 5 rows in 0.173628568649292 seconds
Processed 5 rows
Results match

Signed-off-by: Yinqing Hao <[email protected]>
@yinqingh
Copy link
Collaborator Author

Observed inconsistent results for query18. Need further investigations on this.

Row 22:
['Customer#079420369', 79420369, 15970070725, datetime.date(1995, 1, 15), Decimal('569974.15'), Decimal('324.00')]
['Customer#079420369', 79420369, 7380136135, datetime.date(1995, 1, 15), Decimal('569974.15'), Decimal('324.00')]

Processed 100 rows
There were 1 errors
=== Unmatch Queries: ['query18'] ===

@wjxiz1992
Copy link
Collaborator

I rerun this query with the same data(we synced offline) and got the same diff, I'm afraid it's a rapids bug. I'll file a bug at rapids side.

@yinqingh yinqingh force-pushed the yinqing-ndsh-validate branch from a3cb119 to 574de5a Compare October 17, 2024 03:35
Returns:
bool: True if result matches otherwise False
"""
if query_name in SKIP_QUERIES:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip checking query15_part1 and query15_part3 since these are create/drop view queries and no output for these queries.

use_iterator: bool):
# skip output for specific query columns
if query_name in SKIP_COLUMNS:
df = df.drop(*SKIP_COLUMNS[query_name])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop column o_orderkey of the output of query 18 due to the non-deterministic results

@yinqingh yinqingh changed the title [WIP] Add nds_h validation script Add nds_h validation script Oct 17, 2024
import os
import re
import time
from decimal import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: too wide import?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

from decimal import *

from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.types import *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same as above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

df = df.drop(*SKIP_COLUMNS[query_name])

# apply sorting if specified
non_float_cols = [col(field.name) for \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    non_float_cols = [col(field.name) for field in df.schema.fields 
                      if field.dataType not in (FloatType(), DoubleType())]
    float_cols = [col(field.name) for field in df.schema.fields 
                  if field.dataType in (FloatType(), DoubleType())]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks!

return math.isclose(expected, actual, rel_tol=epsilon)
elif isinstance(expected, str) and isinstance(actual, str):
return expected == actual
elif expected == None and actual == None:
Copy link
Collaborator

@pxLi pxLi Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these were covered in below return ``expected == actual`

Copy link
Collaborator Author

@yinqingh yinqingh Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated this function to make it more readable and I think it should work as the same as previous one. Please let me know if some special cases are not covered in this function. Thanks! cc @wjxiz1992

def compare(expected, actual, epsilon=0.00001):
    #TODO 1: we can optimize this with case-match after Python 3.10
    #TODO 2: we can support complex data types like nested type if needed in the future.
    #        now NDS only contains simple data types.
    if isinstance(expected, float) and isinstance(actual, float):
        # Double is converted to float in pyspark...
        if math.isnan(expected) and math.isnan(actual):
            return True
        return math.isclose(expected, actual, rel_tol=epsilon)

    if isinstance(expected, Decimal) and isinstance(actual, Decimal):
        return math.isclose(expected, actual, rel_tol=epsilon)

    return expected == actual

raise Exception(f"More than one summary file found for query {query_name} in folder {prefix}.")
if len(file_glob) == 0:
raise Exception(f"No summary file found for query {query_name} in folder {prefix}.")
for filename in file_glob:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(file_glob) > 1: should have already errored out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

Signed-off-by: Yinqing Hao <[email protected]>
@yinqingh yinqingh force-pushed the yinqing-ndsh-validate branch from 56fc6f2 to 25a0cdd Compare October 17, 2024 05:55
Copy link
Collaborator

@wjxiz1992 wjxiz1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wjxiz1992 wjxiz1992 merged commit a263915 into NVIDIA:dev Oct 23, 2024
2 checks passed
@yinqingh yinqingh deleted the yinqing-ndsh-validate branch October 23, 2024 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants